-
Notifications
You must be signed in to change notification settings - Fork 508
Conversation
Oh, great. Going to learn about "cgroups". |
@janvorli I'm porting your change in dotnet/runtime#980 and hitting an assert:
I added some instrumentation in 12b821d and I'm getting:
In particular the root and the path relative to mount are the same (in your examples in cgroup.cpp, there should be an extra directory in the latter). Do you know if I'm just holding it wrong, or the code in GC doesn't handle the way Docker is configured for this particular build pool? |
The assert was wrong, it didn't take into account the unnamed group case. And it was never detected as the equivalent code in the coreclr PAL's |
* Add parenthesis to parameters. * Revert changes to IS_UNWINDING. Commit migrated from dotnet/coreclr@ce29457
+ when hardlimit is specified we should only retry when we didn't fail due to commit failure - if commit failed it means we simply didn't have as much memory as what the hardlimit specified. we should throw OOM in this case. + added some diag info around OOM history to help with future diagnostics. Commit migrated from dotnet/coreclr@7dca41f
This is the 1st part of BGC tuning using FL (free list) with a PID loop. Historically the most significant factor for triggering a BGC is based on the allocation budget. This experimental feature triggers based on the FL in gen2/3 with a PID loop (by default we only use PI, no D) so we use the allocation calculated based on the FL to determine when to trigger the next BGC. The goal of the PI feedback loop The end goal of the tuning is keep a const physical ML (memory load). The ML goal is specified as a percentage (meaning the percent of physical memory in use). And we aim to do as few BGCs as possible to achieve this memory load. This is most useful for the case where you are seeing that a large percent of free space in gen2 isn't getting used when a BGC is triggered; or you have plenty of memory but the GC heap size only takes up a small percentage and you want to delay BGCs to reduce the CPU consumed by BGC. Enable the FL tuning PI loop Since this is an experimental feature, by default it's disabled. To enable set this env var: set COMPlus_BGCFLTuningEnabled=1 When the FL tuning is enabled, by default we set the ML load to 75%. You can change it with this env var: set COMPlus_BGCMemGoal=X Note as with any COMPlus var, the value is interpreted as a hex number, not dec. Perf consideration of the current PI loop Of course there’s always perturbation. From BGC’s POV there are 2 categories of perturbation – 1) from GC’s own perf characteristics changes, for example, suddenly we see a lot of pins from gen1 that get promoted into gen2. 2) non GC factors – this could be due to sudden increase of native memory usage in the process; or other processes on the same machine simply increase/decrease their memory usage. And generally we don’t want to do something very high like 90% ‘cause it’s hard to react when the memory is tight – GC would need to compact and currently BGC does not compact. So for now we have to assume that “retracting the heap is difficult” which means we want our PI loop to be fairly conservative. So we actually have another PI loop (the inner loop) to make sure the “sweep flr” is at a reasonable value. “Sweep flr” is the FLR (Free List Ratio) before BGC rebuilds the free list – so you can think of this as the smallest flr during a BGC. So the inner loop has a “sweep flr” goal of 20% by default which is pretty conservative. And when we can incrementally compact I would expect to reduce this by a fair amount. Another possibility is we do not set this as a fixed number and rather calculate a reasonable one dynamically based on what we observe how the free list is used. Of course just because BGC does not compact it doesn’t mean that the total gen2 size cannot get smaller. It could get smaller just by objects at the end of gen2 naturally dying. + Initialization of the PI loops We have to have some way to get this whole thing started so I usually do a few BGCs to reach 2/3 mem load goal then start using PI loops to decide when to trigger the next BGC. + Panic mode I use a very simple rule to see if I should panic, ie, do an NGC2. If we observe the memory load is (goal + N%) where N is just a number we determine, we do an NGC2. This actually turned out to give decent results because we give it ample opportunity to allow some oscillation around goal (instead of panicking prematurely). + Implementation notes When FL tuning is not enabled there should be no effect. Record things when BGC starts, BGC sweep ends and BGC end. I have other mechanisms like the D term, FF (feed forward) and smoothing. I have experimented with them in the past. Currently they are not enabled by default but can be enabled with COMPlus env vars. Currently this doesn't work great with LOH because we have a fundamental limitation which is if we give free space to gen2 it's difficult to give it to LOH. One thing the user could do is to adjust the LOH threshold so most of the large object allocations happen in gen2. Commit migrated from dotnet/coreclr@30bb5b5
Implement card marking stealing for better work balance in Server GC. One of the last stages in the mark_phase is to mark objects referenced from older generations. This stage is often slow compared to the other stages, and it is also often somewhat unbalanced, i.e. some GC threads finish their work significantly sooner than others. The change also applies to the relocate_phase, but that phase usually takes significantly less time. This change implements thread-safe enumeration of older generations by dividing them into chunks (2 MB in 64-bits, 1 MB in 32-bits), and arranges it so threads finishing work early will help on other heaps. Each thread grabs a chunk and then looks through the card table section corresponding to this chunk. When it's done with a chunk, it grabs the next one and so on. There are changes at multiple levels: - at the top level, mark_phase and relocate_phase contain changes to check for work already done for both the heap associated with the thread and other heaps. - these routines call mark_through_cards_for_segments and mark_through_cards_for_large_objects which contain code to walk through the older generations in chunks. - ultimately card_marking_enumerator::move_next implements the thread safe enumeration, supplying chunks, and gc_heap::find_next_chunk supplies a chunk where all card bits are set. Commit migrated from dotnet/coreclr@5ca444c
…reclr#27384) * Use half-fences for volatile load/stores on Windows ARM64 * Updated Volatile.h in gc/env as well. unified on type names and warning suppression. Commit migrated from dotnet/coreclr@c128dba
Remove gcc nonnull-compare suppression and impossible conditions. > error: nonnull argument 'this' compared to NULL > [-Werror=nonnull-compare] Commit migrated from dotnet/coreclr@4afbe3a
* Fix available memory extraction on Linux The GlobalMemoryStatusEx in PAL is returning number of free physical pages in the ullAvailPhys member. But there are additional pages that are allocated as buffers and caches that get released when there is a memory pressure and thus they are effectively available too. This change extracts the available memory on Linux from the /proc/meminfo MemAvailable row, which is reported by the kernel as the most precise amount of available memory. Commit migrated from dotnet/coreclr@859f464
Normalize trailing whitespaces in frequently changing files (docs and sources) Commit migrated from dotnet/coreclr@ed5dc83
* Two simple fixes to suspension issues seen in GCPerfSim: - Insert allow_fgc() call in background_mark_simple - this fixes the cases where there are a ton of GC handles referencing simple objects not containing pointers. - Change PING_JIT_TIMEOUT constant from 10 milliseconds to 1 millisecond. This fixes the case where return address hijacking doesn't work quickly, because the hijacked thread doesn't return (e.g. because it's in a loop doing further calls). In this case we have to retry the hijack, and changing the timeout constant makes this happen more quickly. Commit migrated from dotnet/coreclr@fab7aa2
Commit migrated from dotnet/coreclr@96e08af
* Fixing an assert in card stealing. * use ifdef to be consistent with other cases in the file * PR feedback Commit migrated from dotnet/coreclr@7aa67a9
* Changes to set gen0 bricks always. This reduces the time spent in find_first_object when finding the start of objects for marking interior pointers. * Revert "Changes to set gen0 bricks always. This reduces the time spent in find_first_object when finding the start of objects for marking interior pointers." This reverts commit dotnet/coreclr@9d53ff9. * Two fixes to speed up suspension for foreground GCs while background GC is in progress: - In background_mark_simple1, check g_fSuspensionPending and if it is set, save the state of the work and restart the loop - this will call allow_fgc() and thus allow a foreground GC to take place. - In revisit_written_page, call allow_fgc() at the end - this allow a foreground GC to happen whenever we are done with revisiting a written page. * Addressed code review feedback - use counter instead of testing g_fSuspensionPending directly. Commit migrated from dotnet/coreclr@a7678f5
* Suppress on clang only * Fix integer conversion * Extra qualifier * Suppress warning * Extra qualifier * Signedness issues * Correct offsetof * Offsetof doesn't support non-constant values * Conversion errors * Move the comment too * Fix assembly warning * size is not constant * Fix comment type * Fix endmacro name * Use OFFSET_NONE constant Commit migrated from dotnet/coreclr@e8bbcf1
…clr#27846) Unifying the type used for number of heaps/locks/threads It is the same small number and should be just int. NOTE: Initial number of allocated blocks per generation is also the same as number of heaps. Commit migrated from dotnet/coreclr@7405e43
* Fix getting affinity set on MUSL on Jetson TX2 The code in PAL_GetCurrentThreadAffinitySet relied on the fact that the number of processors reported as configured in the system is always larger than the maximum CPU index. However, it turns out that it is not true on some devices / distros. The Jetson TX2 reports CPUs 0, 3, 4 and 5 in the affinity mask and the 1 and 2 are never reported. GLIBC reports 6 as the number of configured CPUs, however MUSL reports just 4. The PAL_GetCurrentThreadAffinitySet was using the number of CPUs reported as configured as the upper bound for scanning affinity set, so on Jetson TX2, the affinity mask returned had just two bits set while there were 4 CPUs. That triggered an assert in the GCToOSInterface::Initialize. This change fixes that by looping over all cpu indices in the affinity set. Similar fix went to GetProcessorForHeap and related stuff in gcenv.unix.cpp
While named cgroups work fine outside of docker container, they weren't working when created and used inside of a docker container. The problem was caused by the fact that the hierarchy root extracted from /proc/self/mountinfo and the cgroup path extracted from /proc/self/cgroup are not equal for named groups. They just share the same prefix. The cgroups handling code was not epxecting this case and ended up building the final cgroup path incorrectly (including the common part of the path). This change fixes it by checking for matching prefix of the paths instead of comparing the whole paths.
The amount of strong name support that CoreCLR needs is very small (really just a method to convert public key to public key token). It is not worth it to build a separate .lib for just this single method. Fold the strong name APIs into metadata and change the API to return HRESULT.
The allocate_in_free code path in allocate_in_expanded_heap incorrectly calculated the large (double) alignment padding size when limiting the plug size (SHORT_PLUGS) if set_padding_on_saved_p was true: set_padding_in_expand (old_loc, set_padding_on_saved_p, pinned_plug_entry); // Sets the padding flag on the saved plug ... pad += switch_alignment_size (is_plug_padded (old_loc)); // Reads the padding flag from the old (different!) plug That caused access violation during a later heap walk since the g_gc_pFreeObjectMethodTable pointer marking the gap was not placed at the right address.
* gc, pal: Avoid linking libnuma.so on non-NUMA systems In an effort to reduce the system calls performed during the initialization, this seemed like another low-hanging fruit: only link in libnuma.so if the system has more than one NUMA node. * gc, pal: Remove unneeded dlsym() call The result of dlsym() is never attributed to any variable, and this function is never called, so it's safe to remove these dlsym() calls.
Improve comments and fix a couple of places where large alignment was not taken into account. Set pad_in_front to zero if SHORT_PLUGS is not defined.
* refactoring * ploh-->uoh * fix Linux build PR feedback * some PR feedback * more PR feedback more PR feedback * more PR feedback * address comments in UpdatePreGCCounters * removed a confusing comment * stress fix in background sweep * use `saved_sweep_ephemeral_seg` in background sweep. * fixed `GCPerHeapHistory_V3` event. * re-implemented dotnet/runtime#2103 on top of refactoring (it was hard to merge)
* Use gcenv.windows.inl on Windows HOSTS * Make two way pipe implementation depend on host We have no ability to implement a cross OS pipe. Allow the two way pipe implentation depend on the host. * Use empty debug macro implementation on Windows hosts. * Make HMODULE type depend on host compiler * Configure long file path wrappers based on host * Configure memcpy based on host * Configure FreeLibrary based on host * Configure host utility funtion based on host VirtualAlloc, CPU count, NUMA config make most sense as host queries. Configure based on host. * Configure exceptiion holder based on host * Prefer #if HOST_WINDOWS for longfilepathwrappers.cpp * Prefer #if HOST_WINDOWS for src/coreclr/src/inc/holder.h * Do not include GCToOSInterface::GetPageSize() in DAC * Remove JIT64_BUILD * !HOST_UNIX -> HOST_WINDOWS
Previously, there was an additional check `if (!join_struct.wait_done)` at the beginning of an rjoin. This usually had no effect as the join isn't usually done yet. But if thread A is the first to enter the join and finishes before thread B starts, then B will never enter the `if`. The only effect this had was that B would not fire join events in these rare cases. (After that we check `join_struct.wait_done` again anyway.) As of this PR we will always fire the events, which makes them easier to analyze consistently. Before this PR, looking at the join events for a single thread would show no traces of the rjoin happening, except for an extra restart event from the other thread.
…rity (#32033) Fixes #31995
* Introducing Pinned Object Heap * PR feedback * reverted a test-only change
* The main difference of `AllocLHeap` is that it uses per-heap acontext instead of per-thread. There is no advantage in that and results in allocations under-reported in `GetAllocatedBytesForCurrentThread` This change unifies to one allocation entry point - `Alloc` (and its `AllocAlign8` variety) * Removed AllocAlign8 * PR feedback - some refactoring to merge duplicate calls to `Alloc` * Splited an `ifdef/else` in two to not cross code blocks. * No need to update `GC_INTERFACE_MAJOR_VERSION` more than once per release. And we already did for this one.
…(#33597) * Delete TlsIdx_XXX * Delete EnableTerminationOnHeapCorruption * Delete redundant cant stop tracking * Workaround compiler bug
* Fix native components build for Android * Add cmake introspection for pthread_setcancelstate * Address CR feedback * Use calculated eth speed for Android * Use #ifdef FEATURE_EVENT_TRACE
* Adding API for POH allocations and propagating flags all the way to Alloc. * make `AllocateUninitializedArray` and `AllocateArray` public * Added NYI implementations to Mono * moved tests to libraries * Actually use POH and more tests. * Disable tests for the new API on mono * mop up remaining TODOs * Fix build breaking whitespace. * Mono tabs and mark heavier tests as [Outerloop] * Mono space before openning parens and braces * Refactored AllocateArray * PR feedback * XML Doc comments
runtime\src\coreclr\src\gc\gc.cpp(4343,1): warning C5208: unnamed class used in typedef name cannot declare members other than non-static data members, member enumerations, or member classes
There is an assert in FindCgroupPath that fires when hierarchy_root and cgroup_path_relative_to_mount are equal, which is the case for cgroups that are not named. This assert checks that the common path in those two variables ends with / which is only the case with named groups. We have never seen this assert to fire because cgroups initialization happens before the debugger support initialization in PAL and so asserts are disabled at that point. I am going to fix that in a separate PR. This problem was discovered with the standalone GC where the assert actually fires as it uses a plain C assert function. This change fixes the assert to account for the case when both the paths are the same.
12b821d
to
eafb87c
Compare
CI is not triggering - need to check CI status before merging. |
Skipping commits:
dotnet/runtime@631cb1b - I think we need to pull the Perl script to generate ETW boilerplate into the open.
dotnet/runtime@fcd862e - not going to mess with that right now
dotnet/runtime@6c1f8ad - I think this is for another commit that we skipped
dotnet/runtime@f99b4fe - more ETW